The white wine quality dataset from Cortez et al. (2009) is explored. The wines are classified as Vinho Verde and are exclusively produced in the demarcated region of Vinho Verde in northwestern Portugal. These wines are described to possess “vibrant freshness, elegance, lightness and aromatic and flavorful expressions.” The paper and data can be found here:
P. Cortez, A. Cerdeira, F. Aloesseida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
The attributes included in the data set are described below (taken from “wineQualityInfo.txt”):
Based on the descripitons of Vinho Verde wines and the attributes in the data set, the features that are predicted to positively contribute to quality are:
The features that are predicted to negatively contribute to quality are:
The following analysis explores the attributes in a systematic manner. The features that mainly influence quality are then further investigated.
## 'data.frame': 4898 obs. of 14 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## $ score : Factor w/ 7 levels "3","4","5","6",..: 4 4 4 4 4 4 4 4 4 4 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
##
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
##
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
##
## alcohol quality score
## Min. : 8.00 Min. :3.000 3: 20
## 1st Qu.: 9.50 1st Qu.:5.000 4: 163
## Median :10.40 Median :6.000 5:1457
## Mean :10.51 Mean :5.878 6:2198
## 3rd Qu.:11.40 3rd Qu.:6.000 7: 880
## Max. :14.20 Max. :9.000 8: 175
## 9: 5
There are 4898 white wines in the dataset with 11 real features (“fixed.acidity”, “volatile.acidity”, “citric.acid”, “residual.sugar”, “chlorides”, “free.sulfur.dioxide”, “total.sulfur.dioxide”, “density”, “pH”, “sulphates”, and “alcohol”). The analysis of the dataset is centered on how these features are related to the “quality” of a wine. An extra variable, “score”, is an ordered factor version of the “quality” feature. The “score” ranges from 0 - 10 (best).
Most wines are rated in the middle to mid-high (5-6), with a median “quality” of 6. Most features appears to have large outliers. “density” and “pH” might be exceptions. These characteristics might be easier to measure accurately than the other features. The median “residual.sugar” content is 5.200 \(g/dm^3\) and the median “alcohol” content is 10.51 vol.%.
The quality of white wines appear to follow somewhat of a normal distribution. The scale is from 0 - 10, but the lowest score given was a 3 (20 wines) and the highest was a 9 (5 wines). Are there common profiles for the worst and best wines?
A set of base histograms are created for all attributes.
The plots above show the distribution of all of the features. The base histograms show that “fixed.acidity”, “citric.acid”, “total.sulfur.dioxide”, “density”, “pH”, and “sulphates” are normally distributed while “volatile.acidity”, “residual.sugar”, “chlorides”, “free.sulfur.dioxide”, and “alcohol” have skewed distributions. However, binwidths and axes need adjustment in order to find any unexpected distributions. Histograms of features that provide significantly more information than the base histograms are presented below.
There appears to be a ~0.3 \(g/dm^3\) peak and a ~0.5 \(g/dm^3\) spike. It would be interesting to know which wines have a citric acid content ~0.5 \(g/dm^3\).
There appears to be a bimodal distribution for residual sugar content. There are probably wines for people who prefer drier wines and for othes who prefer sweeter wines. It would be interesting to know the properties of these two subsets.
There is a long tail of higher chloride concentrations for the lower quality wines.
The lower quality wines tend to have lower free sulfur dioxide concentrations.
The higher quality wines tend to have more alcohol content.
A set of box plots are created for all features. The data is limited to the middle 90%. The mean for each category is plotted as an “x”.
The features that appear to vary with wine quality are “volatile.acidity”, “citric.acid”, “residual.sugar”, “chlorides”, “free.sulfur.dioxide”, “total.sulfur.dioxide”, “density”, “pH”, and “alcohol”. It seems that the variation within a feature is more obvious through these set of box plots than the previous set of histograms. The top features appear to be “density” and “alcohol”.
Scatterplots and correlation calculations of the characteristics of wine that might be more closely associated to quality should help ascertain which features might be important to quality. Additionally, how the individual features correlate with each other will be investigated.
## volatile.acidity citric.acid residual.sugar
## volatile.acidity 1.00000000 -0.149471811 0.06428606
## citric.acid -0.14947181 1.000000000 0.09421162
## residual.sugar 0.06428606 0.094211624 1.00000000
## chlorides 0.07051157 0.114364448 0.08868454
## free.sulfur.dioxide -0.09701194 0.094077221 0.29909835
## total.sulfur.dioxide 0.08926050 0.121130798 0.40143931
## density 0.02711385 0.149502571 0.83896645
## pH -0.03191537 -0.163748211 -0.19413345
## alcohol 0.06771794 -0.075728730 -0.45063122
## quality -0.19472297 -0.009209091 -0.09757683
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## volatile.acidity 0.07051157 -0.0970119393 0.089260504
## citric.acid 0.11436445 0.0940772210 0.121130798
## residual.sugar 0.08868454 0.2990983537 0.401439311
## chlorides 1.00000000 0.1013923521 0.198910300
## free.sulfur.dioxide 0.10139235 1.0000000000 0.615500965
## total.sulfur.dioxide 0.19891030 0.6155009650 1.000000000
## density 0.25721132 0.2942104109 0.529881324
## pH -0.09043946 -0.0006177961 0.002320972
## alcohol -0.36018871 -0.2501039415 -0.448892102
## quality -0.20993441 0.0081580671 -0.174737218
## density pH alcohol quality
## volatile.acidity 0.02711385 -0.0319153683 0.06771794 -0.194722969
## citric.acid 0.14950257 -0.1637482114 -0.07572873 -0.009209091
## residual.sugar 0.83896645 -0.1941334540 -0.45063122 -0.097576829
## chlorides 0.25721132 -0.0904394560 -0.36018871 -0.209934411
## free.sulfur.dioxide 0.29421041 -0.0006177961 -0.25010394 0.008158067
## total.sulfur.dioxide 0.52988132 0.0023209718 -0.44889210 -0.174737218
## density 1.00000000 -0.0935914935 -0.78013762 -0.307123313
## pH -0.09359149 1.0000000000 0.12143210 0.099427246
## alcohol -0.78013762 0.1214320987 1.00000000 0.435574715
## quality -0.30712331 0.0994272457 0.43557472 1.000000000
The “alcohol” feature has a strong correlation with wine quality. The “density” feature has a moderately strong correlation with the quality of wine. “residual.sugar” presents a bimodal distribution, and would not necessarily have a strong linear correlation with wine quality. It will be explored further since “density”, and “alcohol” have strong correlations with “residual.sugar”. In fact, the correlation between “density” and “residual.sugar” is the strongest in this set. Similarly, “total.sulfur.dioxide” has a strong correlation with “density”. “alcohol” also has a strong correlation with “chlorides” and “total.sulfur.dioxide”. The relation of these variables with each other and quality will be analyzed more closely.
There is a strong negative linear relationship between “density”" and “alcohol”. As “alcohol” increases, “density”" decreases. The higher quality wines tend cluster in the lower right region, which is the higher alcohol content, lower “density” wines. It would be interesting to see if this ratio is a better feature. Additionally, there appears to be a clear separation between wines with higher “residual.sugar” and lower “residual.sugar”. The higher “residual.sugar” wines tend to have a higher density:alcohol ratio.
The “residual.sugar” decreases with increasing “alcohol”. However, there isn’t a strong linear relationship between “residual.sugar” and “alcohol” for the whole range. An exponential decay in “residual.sugar” appears to exist with increasing “alcohol”. The higher “quality” wines tend to have a higher “alcohol” and lower “residual.sugar” content.
There is a negative linear relationship between “total.sulfur.dioxide” and “alcohol”. As “alcohol” increases, “total.sulfur.dioxide” decreases.
There is a negative linear relationship between “chlorides” and “alcohol”. As “alcohol” increases, “chlorides” decreases.
The “residual.sugar” increases with increasing “density”. The relationship between “residual.sugar” and “density” appears to be more of an exponential growth. The higher “quality” wines tend to have a higher “residual.sugar”:“density” ratio.
There is a linear relationship between “total.sulfur.dioxide” and “density”. As “total.sulur.dioxide” increases, “density” increases. This plot also shows that higher “quality” wines tend to have lower “density” values.
The previous scatterplots showed that the Density:Residual Sugar and Density:Alcohol may be important transformed features. Here, the correlation of the ratios with quality is examined.
The correlation between “density”:“residual.sugar” and “quality” is 0.008996164 and is low, showing that it is not an important feature for quality. The correlation between “density”:“alcohol”" and “quality” is -0.4244115. It is a strong negative correlation, but the positive correlation between “quality” and “alcohol” is stronger.
Now that the secondary attributes are investigated, the median attributes of “alcohol” and “density” by quality are investigated.
Medians of “alcohol” and “density” values may have linear relationships with “quality”. The “alcohol” medians provide a correlation of 0.8476837 with “quality”. The “density” medians provide a correlation of -0.8885051 with “quality”.
The secondary peak is at 0.49 \(g/dm^3\), which is much greater than the average concentration found at every quality level.The citric acid supposedly adds ‘freshness’ and flavor to wines. There are 215 wines with this acidity, and the majority of them are mediocre wines with average sugar content, average alcohol content, and average density (compared to global averages).
For this part of the analysis, the wines are separated into two sets. A set of wines that has 4 \(g/dm^3\) residual sugar content or less, and a set of wines that has more than 4 \(g/dm^3\) residual sugar.
## volatile.acidity citric.acid residual.sugar
## 0.26782785 0.32875060 1.82293753
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## 0.04411016 29.90295660 120.12398665
## density pH alcohol
## 0.99182044 3.21308059 11.00906056
The low “residual.sugar” subset has slightly higher correlation magnitudes between “quality” and most other attributes, on average. However, the magnitude of correlations of all other attributes are generally lower in this subset than the ones in the full data set. The “residual.sguar” correlations are surprisingly much much lower in this subset. The average attributes of this subset are also similar to the average attributes of the full data set. The exceptions include: lower “residual.sugar”, “free.sulfur.dioxide”, and “total.sulfur.dioxide” values.
## volatile.acidity citric.acid residual.sugar
## 0.28603713 0.33826491 9.81165655
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## 0.04701678 39.35469475 152.01374509
## density pH alcohol
## 0.99567963 3.16968940 10.14383434
This subset does not appear to be significantly different from the full data set (other than “residual.sugar”) even though it contains only about 57% of the wines,. This subset does have slighlty higher means in “chlorides” and “free.sulfur.dioxide”.
## volatile.acidity citric.acid residual.sugar
## 0.37598361 0.30770492 4.82103825
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## 0.05055738 26.63387978 130.23224044
## density pH alcohol
## 0.99434306 3.18338798 10.17349727
The worst wines have a higher “volatile.acidity” and “chlorides” andlower “citric.acid”, “residual.sugar”, and “free.sulfur.dioxide” than the average of all wines. The “quality” of the worst wines are more associated with “free.sulfur.dioxide” and “total.sulfur.dioxide”. “alcohol” and “density” hardly correlate with “quality”. Perhaps this is because there are only two levels for “quality”.
## volatile.acidity citric.acid residual.sugar
## 0.27797222 0.32816667 5.62833333
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## 0.03801111 36.62777778 125.88333333
## density pH alcohol
## 0.99221439 3.22116667 11.65111111
The best wines have lower “chlorides” and more “alcohol”. Similarly, alcohol" and “density” hardly correlate with “quality”. This is probably because there are only two levels for “quality” in this subset as wll.
A linear model does not seem appropriate for predicting the quality of a wine since “qualtiy” is a categorical variable. In fact, Cortez et al. (2009) use Support Vector Machine (SVM) to predict the quality of wine. The ultimate outcome of this analysis is to highlight the main features of Vinho Verde white wine quality.
The features that contribute to Vinho Verde white wine “quality” the most are “alcohol” and “density”. “residual.sugar”, citric.acid“,”chlorides“, and”free.sulfur.dioxide" may also be the next important features. These features are different from the predicted features (“citric.acid”, “total.sulfur.dixoide”, and “volatile.acidity”) that were based on the Vinho Verde description and features descriptions. The only feature that was predicted to be importatnt was “density”. Correlations between the features were examined and interesting subsets were further analyzed.
Alcohol content appears to influence wine “quality” positively, if the wine is at least mediocre (quality level of 5 and greater). Generally, as alcohol content increases, wine quality also increases. The top plot shows the range of the alcohol content for each “quality” level and the bottom plot shows how both the median and means of alcohol content increase with “quality”.
“density” and “alcohol” have the largest correlation with “quality”, but there is an underlying relationship between “density”, “alcohol”, and “residual.sugar”. The top plot shows how “density”" decreases with increasing “alcohol”" content. Additionally, there is a clear separtion between higher and lower “residual.sugar” content. While “residual.sugar”and “density”:“residual.sugar” did not highly correlate with “quality”, this plot shows that there is a separation in “quality”. Since these three features are highly correlated, perhaps not all three features should be included in predicting wine “quality”.
This last plot shows that there are two subsets of sugar content wine for people who prefer sweeter wines or drier wines. It was analyzed that other than sugar content, these subsets were not significantly different from the average of all wnes.
The most difficult parf of the analysis was recognizing that wine “quality” could not be treated as a continuous variable, like diamond price. At first, I was trying to find which feature transformations would lead to the highest correlation with “quality”. Since there are only seven “quality” categories, correlations do not tell the whole story. After this realization, the analysis became easier and I was able to ask interesting questions. I think finding the interesting subsets (citric acid peak, and residual sugar subsets) will be useful for future analysis and applying a machine learning algoritm.